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1. Introduction 

The determination of food composition is fundamental to theoretical and applied 
investigations in food science and technology, and is often the basis of establishing the 
nutritional value and overall acceptance from the consumer standpoint. Most of the methods 
described in a previous report [1] (Baianu et al, 201 la) are useful for the conventional analysis of 
foods, that is, the determination of the major components (proteins, lipids, moisture, 
carbohydrates, and minerals). These components are included in standard tables of food 
composition. Advances in food analysis in the last three decades have resulted from the 
development of many instrumental methods such as NIR and from the improvements in 
separation methods (mainly chromatography). 

The analyst often assumes that the sample to be analyzed is homogeneous. It is advisable 
that before starting a determination, the whole sample be mixed to eliminate heterogeneity - 
mainly in particle size and moisture distribution (Pomeranz and Meloan 1994). In some foods 
like concentrated sugar solutions, the sample must be heated carefully to dissolve sugar crystals. 

Why would one wish to analyze the composition of soy and other health foods? 
Because soy and other health foods are important for lowering cholesterol and the prevention, or 
treatment of atherosclerosis and coronary heart disease. Soy food composition is also important 
for weight loss/weight control (Liu et al, 1995). Therefore, quality control and routine 
monitoring of soy and other health food composition is important to the consumers. Monitoring 
the levels of isoflavones in health foods such as soymilk appears also to be important in 
populations that are at risk for certain types of cancers. Rapid, accurate, and cost-effective 
composition analyses of soyfoods and other health foods are essential for improving the 
efficiency and quality of health food production. This is the first attempt at developing Fourier 



Transform Near Infrared Reflectance Spectroscopy (FT-NIRS) calibrations for soy and other 
health foods. 

Soy tofu is a traditional soyfood originated from China (Liu et al. 1995). During the 
course of soybean cultivation, the Chinese had gradually transformed soybeans into various 
forms of soyfoods, including tofu, soymilk, soy paste, soy sauce and soy sprouts. Along with 
soybean cultivation, methods of soyfood preparation were gradually spread to Far East and West 
countries. The art of preparing soyfoods has now spread to the rest of the world, due to 
agricultural innovation and cultural exchanges. For the past several decades, advances in soybean 
chemistry and innovation in processing and packaging technology have dramatically modernized 
traditional ways of preparing soyfoods. As new medical research unveils the health benefits of 
soyfoods, such as the benefits of isoflavones for women's health, there is no doubt that soyfoods 
will soon become a part of global culture. 

It is well known that protein is the dominant component in tofu. In an early report (Koga 
et al., 1992), reported that the NIR spectrum of tofu from 1 100 to 2500 nm region was correlated 
with moisture, crude protein, and fiber contents determined by standard chemical methods, with 
correlation coefficients of 0.976, 0.830, and 0.865, respectively. Some other researchers studied 
the contributions of the total soybean proteins, the storage proteins [glycinin (US) and b- 
conglycinin (7S) fractions] to tofu yield and texture. They analyzed protein contents by using 
SDS-PAGE (SDS-PAGE) coupled with densitometry and reversed phase-high performance 
liquid chromatography (RP-HPLC) (Mujoo et al., 2003). In order to measure directly, rapidly 
and accurately, the soy protein in gels a special tofu calibration was developed with a Spectrum 
One NTS FT-NIRS instrument. 

Soymilk is another popular liquid soyfood, in which protein, carbohydrates and water are 
the three main components (Liu et al. 1995). Protein content in soymilk is usually determined by 
conventional methods such as chemical analysis and UV-Vis Spectroscopy method (Nielsen 
1994). In a previous research on capillary electrophoresis, quantitation of bovine whey proteins 
in commercial powdered soybean milk was performed by adding bovine whey to its formulation 
using the calibration method of the external standard (Garcia-Ruiz et al., 1999). These techniques 
are either time-consuming, or not accurate enough for practical applications. A novel calibration 
was thus developed here with the Spectrum One NTS FT-NIR instrument to accurately measure 
protein, fat and carbohydrate contents in soymilk. For such a purpose, a transflectance working 
mode was employed for spectral data acquisition of soymilk. This mode is usually used for thin 
layer samples in order to reduce the noise level and baseline shift of spectra. If the NIR spectra 



of liquid samples such as milk are obtained with the regular transmittance or reflectance mode, 
accurate quantitation is almost impossible because of the low S/N ratio caused by light scattering 
and large baseline shift (Ozaki et al., 2001). 

The high dietary intake of soya has been associated with a reduced risk of some cancers 
such as breast cancer for women and heart disease. Isoflavones (mainly including daidzein, 
genistein and genistin) may be responsible for the protective role of soya (Liu 1997; Song et al. 
1998). Monitoring the levels of isoflavones in health foods such as soymilk appears also to be 
important in populations that are at risk for certain types of cancers (Liu et al. 1995). Rapid, 
accurate, and cost-effective composition analyses of soy isoflavones are essential for breeding 
and genetic selection studies aimed at optimizing soybean seed compositions for human health 
food applications (Choi et al. 2000; Lee et al. 2003), and improving the efficiency and quality of 
soy health food production. The determination of isoflavones content is commonly done by 
HPLC analysis (Carrao et al. 2002; Choi et al. 2000; de-Rijke et al. 2001; Lee et al. 2003; Song 
et al. 1998; Tekel et al. 1999), or other improved methods with regular liquid chromatography 
(Kao et al. 2002). The HPLC method for isoflavone measurement is expensive, time-consuming, 
and impractical for measurements of large number of soybean samples that are required by 
breeding and selection studies. 

Few studies on NIR, however, have been reported on the analysis of one or two seeds and 
no work has been published on measurement of low-level components such as isoflavones, 
mainly because of the limited spectral resolution and stability of conventional NIR instruments. 
In the past five years, significant improvements in NIR instrumentation have been achieved 
through applications of novel technologies such as Diode Array and Fourier Transform (Guo et 
al., 2002); which thereby provided the potential for single seed analysis of both major 
components and low-level components of soybeans. In this chapter, rapid and accurate analytical 
methods for protein, oil, moisture, and isoflavone determinations were developed with state-of- 
the-art FT-NIR instruments. This is the first attempt at developing Fourier Transform Near 
Infrared Reflectance Spectroscopy (FT-NIRS) calibrations for isoflavones in soybeans. 

2. Calibration and validation methods 

In this section partial least-squares regression models are employed to develop FT-NIR 
calibrations for soy and other health foods, soy tofu and milk, as well as soy isoflavones. 



The general procedures for calibration development for the Spectrum One NTS can be 
described as: 

1. Data acquisition with standard calibration samples. 

2. Wavelength range selection suitable for sample composition determination, based 
on effective absorption bands. 

3. Use "Interactive Baseline Correction" function to correct spectrum baselines and 
then normalize the spectra. 

4. Matrix calculations by PLS-1 algorithm in order to optimize the calibration 
parameters after corrections for light scattering effects in raw spectra by MSC method. 

5. Generate a calibration file with the optimized calibration parameters and make it 
ready for sample measurement. 

In order to improve accuracy and robustness for calibration development, spectral data 
sets based on equally wide concentration ranges of all components were taken to statistically 
maximize the information content in the spectra (Haaland and Thomas, 1988). Despite that the 
calibration algorithm was designed for PLS-1 simulations with a three component mixture 
system, it is applicable to real samples with multiple components. Thus, wide concentration 
ranges of all components in standard samples are necessary for high quality calibration 
development. Although the concentration ranges of all components for real systems may not be 
comparable, it is necessary to make the concentration range of each component as wide as 
possible. 

2.1. Calibration algorithm 

2.1.1. Determining the Number of Factors for the Model 

PLS-1 is a reduced subset of the full PLS-2 algorithm. The algorithms have been 
combined here, with appropriate notes on what they differ. Note that a PLS-2 model of a training 
set with only one constituent is identical to a PLS-1 model for the same data. One of the more 
subtle tasks in using PCR and PLS is choosing the correct number of loading vectors (factors) to 
use to model the data. As more and more vectors are calculated, they are ordered by the degree 
of importance to the model (either by variance in PCA or concentration weighted variance in 
PLS). Eventually the loading vectors will begin to model the system noise (which usually 
provides the smallest contribution to the data). 

The earlier vectors in the model are most likely to be the ones related to the constituents 
of interest, while later vectors generally have less information that is useful for predicting 



concentration. In fact, if these vectors are included in the model, the predictions can actually be 
worse than if they were ignored altogether. Thus, decomposing spectra with these techniques and 
selecting the correct number of loading vectors is a very effective way of filtering out noise. 

However, if too few vectors are used to construct the model, the prediction accuracy for 
unknown samples will suffer since not enough terms are being used to model all the spectral 
variations that compose the constituents of interest. Therefore, it is very important to define a 
model that contains enough vectors to properly model the components of interest without adding 
too much contribution from the noise. 

Models that include noise vectors or more vectors than are actually necessary to predict 
the constituents' concentrations are called overfit. Models that do not have enough factors in 
them are known as underfit. 

Unfortunately, there is usually no clear indicator of how many factors are required to 
move from "constituent" vectors into "noise" vectors and prevent both underfitting and 
overfitting. However, there are a variety of methods that can be used to aid in determining this 
value. One of the most effective is to calculate the PRESS (Prediction Residual Error Sum of 
Squares) for every possible factor. This is calculated by building a calibration model with a 
number of factors, then predicting some samples of known concentration (usually the training set 
data itself) against the model. The sum of the squared difference between the predicted and 
known concentrations give the PRESS value for that model. 

n m ~ 

PRESS =Y. E (CPuj-Cud Eq.(2.1) 
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In the above equation, n is the number of samples in the training set, and m is the 
number of constituents. Cp is the matrix of predicted sample concentration from the model, and 
C is the matrix of known concentrations of the samples. The smaller the PRESS value, the better 
the model is able to predict the concentration of the calibrated constituents. By calculating the 
PRESS value for a model using possible factors and plotting the results, a very clear trend should 
emerge. 



2.1.2. Cross validation 

Cross validation is conceptually very simple to understand, but it is also the most 
calculationally intensive method of optimizing a model. In effect, cross-validation attempts to 
emulate predicting "unknown" samples by using the training set data itself The procedure is as 
follows: 

1 . Select a sample (or a small group of samples, if the training set is large enough) and remove 
the spectrum (spectra) and corresponding concentration data from the data matrix. Set the 
factor counter to I =1. 

2. Use the remaining spectra and concentration data of the samples to perform the 
decomposition and calibration calculations for factor I (loading factor). 

3. Predict the concentrations of the removed samples(s) using the calibration equation from step 
2, and calculate PRESS(I). 

4. Increase the factor counter (1=1+1) and repeat from step 2 until all desired factors (I=f) have 
been calculated and predicted. 

5. Place the previously left our sample data back into the training set and select a different 
sample (or group). Return to step 1 and repeat the calculations. As each sample is left out, 
add the calculated squared residual error to all the previous PRESS values. Repeat until all 
samples have been left out and predicted at least once. 

There are two main advantages of cross-validation over all other methods. The first is in 
how it estimates the performance of the model. Since the predicted samples are not the same as 
the samples used to build the model, the calculated PRESS value is a very good indication of the 
error in the accuracy of the model when used to predict "unknown" samples in the future. The 
larger the training set and the smaller the groups of samples left out in each pass (optimally only 
one sample at a time, but this can be very time consuming), the better this estimate will be. In 
effect, the model is validated with a large number of "unknown" samples (since each training 
sample is left out at least once) without having to measure an entirely new set of data. 

The second benefit of cross validation is better outlier detection. While this will be 
discussed in more depth in a later section, it can be mentioned that cross validation is the only 
validation method that can give complete outlier detection for the training set data. Since each 
sample is left out of the models during the cross validation process, it is possible to calculate how 
well the spectrum matches the model by calculating the spectral reconstruction and comparing it 
to the original training spectrum (via the spectral residual). If the predicted concentrations for a 



single sample are way off and the spectrum does not match the model very well but the rest of 
the data works very well, the sample is possibly an outlier. Identifying and removing outlier 
samples from the training set should always improve the predictive ability of the model. Only if 
a complete cross validation is performed, the outlier detection on the training set data can be well 
performed. Unfortunately, cross validation is a very time consuming process. It requires 
recalculating the models for every sample left out. However, there are a few somewhat 
acceptable shortcuts. If the number of samples in the training set is large enough, the number of 
samples rotated out in each pass can be more than one. This obviously does not give the best 
statistics for each sample, but it does speed the calculations and can be acceptable for 
determining the number of factors for the model. 

2.1.3. Selecting the Factors Based on SECV 

To avoid building a model that is either overfit or underfit, the number of factors where 
the PRESS plot reaches a minimum would be the obvious choice of the best model (except in the 
case of Self-Prediction). While the minimum of the PRESS may be the best choice for predicting 
the particular set of samples, it is not always optimum for prediction of all unknown samples in 
the future. 

The concepts of SECV (Standard Error of Cross Validation) or SEP (Standard Error of 
Prediction) can be better uyilizedd to select the optimal number of factors, instead of PRESS. 
The definition of SECV is: 
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n 

Eq. (2.2) 

where Y i(k) is the known concentrat ion, Y i( . is the predicted concentrat ion 

and n is the number of samples calculated. 

One notes that the SECV expression in Eq. (2.2) is comparable to PRESS. SECV is the averaged 
root mean square of PRESS, and thus it follows the same tendency of variation as PRESS does 
(in ThermoNicolet's TQ Analyst program, SECV is also called RMSECV, with RM standing for 
the root mean). When PRESS reaches its minimum, SECV reaches its minimum as well. 
However, the SECV represents the prediction error for building the calibration model better than 
PRESS does. Therefore, one may use SECV plots and values to indicate the optimized number 



of factors as the choice for the best model. However, for a calibration that is required to be both 
robust and accurate, it is customary to choose the number of factors corresponding to the 
minimum in the plot of Log (PRESS) against the number of factors. 

In Figure 2.1, which is the SECV vs. factor plot for soy tofu calibration development, one 
notices that for numbers of factors between and 15 the SECV decreases as each new factor is 
added to the model. This indicates that the model is underfit and there are not enough factors to 
completely account for the constituents of interest as long as the SECV decreases significantly. 
At some point, the SECV plot should reach a minimum (6), and then begin to increase again. At 
this point the model is beginning to add factors that contain uncorrelated noise which are not 
related to the constituents of interest. When these extra "noise" vectors are included in the 
model, it is an overfit and its predictive ability is diminished. The number of factors for the 
minimum SECV value, (e.g. n = 6), is thus the best choice for the prediction. The correlation for 
calculated (predicted) protein percentage vs. actual protein percentage with 6 factors was plotted 
in Figure 2.2, and a correlation coefficient of 0.999 was reached. 

2.1.4. Outlier sample detection 

Outlier detection is equally important as choosing the optimum number of factors for the 
model. If one or more of the training samples are in error, it will cause errors in the calibration 
model and ultimately poor prediction results for unknowns. Outlier samples usually arise from 
some incorrect measurement, whether it is in the concentration data (i.e. errors in the primary 
calibration techniques, transcription errors), or in the spectral data (i.e. spectrometer error, 
sample handing procedures, environmental control such as temperature, humidity, etc.). 
Including outlier samples in the training set will introduce a bias to the final model. In effect, 
outlier samples will tend to "pull" the model in their direction, causing the predicted 
concentrations of valid samples to be less accurate (or even erroneous) than if the sample was 
completely eliminated from the training set. 

Samples that have significantly larger concentration residuals (difference between the 
actual and predicted concentrations) than the rest of the training set are known as concentration 
outliers. This type of outlier generally arises when the experimenter either makes a mistake in 
creating the calibration mixtures or there was an error in the analysis of the samples from the 
primary calibration techniques used to generate the calibration concentration values. Another 
possibility which frequently occurs is a transcription error: the analyst simply types in the wrong 



concentration value when building the computerized training set. Some obvious outliers can be 
simply picked up by visual inspection. While the human eye is excellent at discerning patterns in 
data, visual inspection is not always a valid basis for a decision of this type. What is really 
needed is a mathematical way to accurately determine the likelihood that a sample is really an 
outlier. For clusters of data points, it is possible to use a measure of the Mahalanobis distance 
(Mahalanobis, 1936). This is calculated as the distance of the potential outlier sample point as 
measured from the mean of all the remaining points in the cluster. The distance is scaled for the 
range of variation in the cluster in all dimensions, and then assigns a probability weight to the 
sample in terms of standard deviation. Any sample which lies outside of 3 standard deviations 
from the mean can be considered suspicious, e.g. 3% deviation for soy and health food 
composition. The Mahalanobis distance is also useful in qualitative analysis of spectral data for 
which the constituent concentrations are not known. 

2.2. Spectra pre-processing 

One of the major problems in applying chemometric models to spectra is the fact that the 
acquired spectrum of a sample is dependent on many different, sometimes uncontrollable factors. 
For example, samples of powdered solids are usually measured by diffuse reflectance. Light 
scattering off the particles causes every spectrum, even remeasurement of the same sample, to 

Figure 2.1. SECV vs. factor plot for protein in the calibration development for soy tofu 

Figure 2.2. Calculated (or predicted) protein% vs. Actual (or reference) protein% plot, with 6 
factors, in the calibration development for soy tofu. 



look a little bit different due to the particle size distribution and alignment with the incident beam 
of light. While the quantitative information related to the constituents is still contained within the 
spectral data, it may not be immediately apparent. Another example is that the pathlength of the 
samples sometimes can not be controlled, such as measuring spectra of thin films. 

Chemometric models can sometimes correct for these effects by adding extra loading 
vectors, but generally the models will perform better if they can be removed or at least 
minimized before running the data through the calculations. Since they are applied to the data 
before it is used in the model, they are often called Preprocessing Algorithms. There are a variety 
of methods that can be used to remove the non-constituent related aberrations in the data. Most 
algorithms are targeted at removing a specific interference (MSC, for example, specifically 
attempts to remove the effects of light scattering). Properly applying preprocessing requires 
understanding the interference in the data and selecting the appropriate algorithms to correct the 
effects. 

2.2.1 Multiplicative Scatter Correction (MSC) 

The NIR detector receives light coming from the sample in form of: diffuse reflectance 
after absorption, specular reflection and scattered light. Only the diffuse reflectance contains 
chemical composition information, whereas the latter two do not. Therefore, in order to 
determine accurately chemical composition from NIR measurements, the light scattering and 
specular components must be corrected for (Williams and Norris, 1987). 

The degree of scattering is dependent on the wavelength of the light that is used, and not 
uniform throughout the spectrum. Typically, this appears as a baseline shift, tilt and sometimes 
curvature. It is not simply a matter of measurement errors that light scattering effect may cause. 
In an early research about scatter-correction for NIR reflectance spectra of meat (Geladi et al., 
1985), reflectance for fat shows completely different tendencies (up and down) before and after 
MSC correction (see Figure 9 on page 498 of Geladi et al., 1985). Therefore, without MSC 
correction, the raw reflectance or absorbance values will make a totally incorrect calibration, and 
lead to wrong prediction for unknown samples. The MSC method assumes that the wavelength 
dependency of the light scattering is different from that of the constituent absorption. 
Theoretically, by using data from many wavelengths in the spectrum, it should be possible to 
separate the two. 



This method attempts to remove the effects of scattering by linearizing each spectrum to 
some "ideal" spectrum of the sample (Galactic, 1996). MSC begins with a calculation of the 
average spectrum from all the data in the training set and uses it as the "ideal" spectrum. 
Thereafter, the spectral responses in each spectrum are used to calculate a linear regression 
against the corresponding points in the ideal spectrum. The slope and offset values from this 
regression are subtracted and ratioed respectively in the original training spectrum to give the 
MSC corrected spectrum. 
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Linear Regression: 



MSC Correction: 



Eq. (2.4) 
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In these equations, A is the n by p matrix of training set spectral responses for all the 
wavelengths, A-bar is a 1 by p matrix of the average responses of all the training set spectra at 
each wavelength, Aj is a 1 by p matrix of the responses for a single spectrum in the training set, n 
is the number of training spectra, and p is the number of wavelengths in the spectra. The mi and 
bi values are the slope and offset coefficients of the linear regression of the mean spectrum vector 
A-bar versus the Aj spectrum vector. By adjusting the slope and offset of the sample spectra to 
the "ideal" average spectrum, the chemical information is preserved while the differences 
between the spectra are minimized. Thus, the major source of random variance between them 
can be removed as much as possible. 



2.2.2. Correcting Baseline Effects: 

Spectrometers rarely collect data with an ideal baseline. In order to accurately calculate 
concentrations, it is necessary to remove the baseline shift effect introduced by the spectrometer, 
especially by specular reflectance in the reflectance mode for PerkinElmer's Spectrum One NTS. 



There are a number of methods used by spectroscopists to remove baseline effects from the 
spectra they collect. The problem with most methods is that they require the spectroscopist to 
decide that the baseline is corrected by visual inspection. However; there are some methods 
which are reasonably automated enough to be used as part of a calibration model, such as Linear 
Regression Baseline Fitting, Two Point Linear Baseline approach, and Derivatives. In 
PerkinElmer's Spectrum program, a special function "Interactive Baseline Correction" is 
designed for users to correct baseline shift for raw spectra, and another function "Normalization" 
is used to normalize spectra so that the absorbance values can be used correctly to fit Beer's Law 
for matrix calculations. 

2.3. Computer iteration steps for calibration development with PLS-1 

The calibration involves regression with a Partial Least Squares Type 1 (PLS-1), multi- 
variate algorithm (Galactic Industries Corporation, 1996). The collection of known data, or 
chemical composition, for each standard samples, together with the measured data by the 
instrument are called a calibration set (or training set). Such calibration algorithms as PLS-1 
base their predictions of each constituent concentration on changes in the spectral data rather 
than absolute absorbance values. A simpler algorithm called "NIPALS" is useful to illustrate the 
iteration procedures followed in PLS-1 as well. The NIPALS algorithm involves two stages: an 
iterative stage that utilizes just the NIR spectral data and a regression stage that utilizes the 
laboratory composition data along with the results from the previous stage. The first iteration 
stage begins by computing the difference between each raw spectrum and the mean spectrum, 
A- A, for the entire calibration set. A set of factors F, or eigenvectors F; are then iterated by 
setting such factors at the beginning to be equal to the raw spectra, A;. Both A and F are 
represented as tables (or matrices) of the NIR absorbance values at specific wavelengths across 
the NIR spectrum of soybeans. From these matrices, one calculates tables (or matrices) of scores, 
Si, defined as a product of two matrices: 

Si = AiFi' Eq. (2.6) 

where Fi' is the transposed matrix of the eigenvector F; . In a second iteration step, the 
eigenvectors Fi are normalized by dividing through the corresponding eigenvalues, A,y , defined 
as: 

hi= (^S, 2 ) 172 Eq.(2.7) 



Thus Fi = Ai /^i are the normalized eigenvectors at this second iteration step. A new set of 
scores is then calculated with equation (2.6) from the normalized eigenvectors. The new set of 
scores is subtracted from the corresponding ones obtained at the first iteration step. The iteration 
is complete when this difference is zero or negligible. If the difference is significant, one 
re-iterates the eigenvectors F; through matrix multiplication: 

Fi = (Ai- A)'xSi, Eq. (2.8) 

until the difference between two values of Si for consecutive iterations becomes zero or 
negligible. Such optimized scores are effectively the absorbance values of individual constituents 
at selected wavelengths across the NIR spectrum of the soybeans. 

The tables of those score values obtained at the first stage are then employed in a second 
stage to relate the absorbance values of individual constituents to the known chemical 
composition stored as a chemical composition table, or matrix, C. The model equation at this 
stage is therefore: 

C = B x S + E c Eq. (2.9) 

where B is regression coefficient matrix and E c is a matrix table of regression error terms for 
chemical composition of the constituents. Once the regression coefficients in martix B are 
determined, the calibration is complete and can be utilized to predict composition values for the 
constituents of unknown samples. 

In the PLS-1 algorithm, an added sophistication is introduced by utilizing from the first 
pass of the iteration a linear combination of calibration spectra weighted by the corresponding 
concentrations of one constituent at a time. In this procedure, the loading vectors (sometimes 
called "spectral weighing vectors"), are defined as: 

W j =C j 'A Eq.(2.10) 

where Q is the composition vector for constituent j. At the next iteration pass, these spectral 
weighing vectors are normalized as follows: 

Wj (pass 2) = Wj (pass 1) / [ Wj (passl) Wj* (pass 1)] Eq. (2.11) 

Therefore, by using loading vectors as eigenfactors, concentration information is included 
in the calculations during the first spectral decomposition stage rather than in a separate second 
stage. This is the main difference between PLS and the NIPALS (also the PCR method). 



Loading factors are actually mimics of the pure component spectra. The first loading 
factor in the PLS-1 analysis is a first-order approximation to the pure-component spectrum of the 
corresponding component. Figure 2.3 gives one graph of the first loading factors for the pure 
components in SPI and H 2 mixture. The pure component spectra of SPI and H 2 generated by 
the computer program look exactly the same as their real spectra. 

The number of calibration loading factors for each constituent can be obtained for the 
minimum value of the SECV. However, for a calibration that is required to be both robust and 
accurate, it is customary to choose the number of factors corresponding to the minimum in the 
plot of Log (PRESS) against the number of factors. 

2.4. Standard Error of Prediction (SEP) 

Standard Error of Prediction (SEP) has the same definition as SECV, but the samples for 
SEP are not involved in the cross validation process for calibration development. The samples 
for SEP are only used to compare predicted values from the developed calibration with known 
values for calibration validation purposes. 

3. Experimental results and data analysis 

3.1. NIR analysis of soy and other health foods 
3.1.1. Sampling and experiments 

FT-NIRS measurements were carried out in quadruplicate for 16 types of food 
samples, such as: soy crisps, dry roasted soy nuts, soy burgers, soy tofu, island black beans, 
soymilk powder, rye cakes, rye bread, rye toast, rye cocktail bread, dry tomato, popcorn 
minicakes, biscuits and lean ham. 



Figure 2.3. A graph of loading factors for the pure components in SPI and H 2 mixture. 



Their composition values were calculated according to the nutrition tables on those products and 
used for calibration data, which are listed in Table 1. The other standard samples were prepared 
by either dehydrating or rehydrating some of the original samples. The total number of samples 
used for this calibration development was 28. FT-NIR spectra were collected over a spectral 
range from 4000 to 12000 cm" 1 (833 to 2500 nm) at a resolution of 8 cm" 1 with a PerkinElmer 
Co.'s FT-NIR spectrometer, model Spectrum One NTS NTS. This spectrometer is optimized for 
high-sensitivity analysis of solid samples, being equipped with an NIRA, integrating sphere 
accessory and an extended range InGaAs detector. The beam size was set to be 8.94 mm. The 
number of scans was 64 for each spectrum. 

Table 1. Composition values of 16 soy and other health foods calculated according to the 
nutrition tables on those products. 

Protein% Fat% Moisture% Total Carbohydrates% Fiber% 



Soy crisps 


25.0 


7.1 


0.5 


50.0 


7.1 


Dry roasted soy nuts 


43.3 


26.7 


<1.0 


20.0 


13.3 


Soy burgers 


20.0 


4.4 


59.7 


8.9 


5.6 


Frida's firm tofu 


7.1 


3.5 


82.2 


2.4 


1.2 


Fried tofu 


9.1 


8.6 


76.1 


2.0 


1.0 


Island black beans 


18.8 


1.6 


5.0 


53.1 


18.8 


Soymilk powder 


10.0 


5.9 


1.7 


69.1 


0.1 


Popcorn minicakes 


12.5 


6.3 


<1.0 


75.0 


6.3 


Rye cakes 


13.0 


<0.1 


1.0 


60.0 


26.0 


Rye bread 


9.7 


4.8 


26.0 


45.2 


6.5 


Light rye bread 


7.3 


3.7 


22.0 


48.8 


2.4 


Rye toast 


10.0 


<0.1 


<1.0 


85.0 


3.0 


Biscuits 


10.0 


<0.1 


<1.0 


85.0 


2.0 


Dry tomato 


<1.0 


<0.1 


4.0 


86.0 


9.0 


Rye cocktail bread 


9.7 


4.8 


26.0 


45.2 


6.5 


Bohllen lean ham 


14.5 


14.2 


68.8 


1.8 


<0.1 



3.1.2. Calibration results 

The TQ Analyst software developed by Nicolet Instruments was employed to process 
NIR spectra and develop calibration files. A total of 112 FT-NIR spectra were preprocessed by 
applying a suitable Multiplicative Scattering Correction (MSC). Partial Least Squares Type 1 
(PLS-1) multivariate regression analyses were employed for high-quality calibration model 
developments. Figure 2.4. shows overlay of group spectra for soy and other health foods 
obtained with Spectrum One NTS after baseline correction and normalization. Standard 
composition values of major food components, such as: protein, fat, moisture, fiber, total 
carbohydrates were obtained from nutrition tables on those products. Composition changes of 
soy and other health foods caused by microwave heating or moisture rehydration were also 
monitored. The composition ranges for calibration development are: protein 0.5% to 43.3%, fat 
0.1% to 26.7%, moisture 0.5 to 82.2%, fiber 0.1% to 26%, total carbohydrates 0.5% to 95%. 
These are quite wide concentration ranges and cover almost all of the soy and other health foods 
contents. The optimized parameters for the calibration result are listed below 

Table 2. Optimized SECV, R 2 values and number of factors for the calibrations developed on 
Spectrum One NTS, Wavelength range 4080 to 11200 cm' 1 . 



Protein% 


Fat% 


Moisture% 


Total Carbohydrates% 


Fiber% 


SECV 1.2 


0.7 


1.4 


1.7 


1.0 


R 2 0.992 


0.994 


0.995 


0.995 


0.985 


# Factors 12 


12 


13 


14 


13 


SEP 1.4 


1.0 


1.6 


1.7 


1.3 



This calibration for soy and other health foods is characterized by low standard errors 
(~1%) and high degrees of correlation between NIR calculated values and laboratory reference 
values (-99%). It will satisfy commercial determination of nutritional contents in soy and other 
health foods. The purpose of developing this calibration is to introduce a new experimental 
method for rapidly and accurately measuring different types of soy and other health foods. The 
results were reported as (see Appendix) "Determination of Soy and Other Health Foods 



Figure 4. Overlay ofFT-NIR Reflectance spectra for soy and other health foods obtained 
with Spectrum One NTS. 



(Source: Composition by Fourier Transform Near Infrared Reflectance Spectroscopy", Jun Guo 
and Ion C. Baianu, Proceedings for the 9th Biennial Conference of the Cellular and Molecular 
Biology of the Soybean, August 1 1-14, 2002, p506). 

3.2. NIR analysis of soy tofu 

3.2.1. Sampling and experiments 

FT-NIRS measurements were carried out in quadruplicate for 19 tofu samples with 
different protein and water contents. The original tofu sample was a commercial product Fridas' 
Firm Tofu, with 7.1% protein, 82.2% water, and -10% other total solid components such as fat, 
salts and carbohydrates. The other samples were prepared by short time microwave heating with 
an interval of 20 seconds, so that the water in tofu could be lost gradually and the protein content 
increased accordingly. The total number of samples used for this calibration development was 
24. The composition values were calculated according to the amount of water loss. The 
composition ranges for calibration development are: protein 7.1% to 39.8%, moisture 27.1% to 
82.2%, and other total solids 10.7% to 33.1%. These are quite wide concentration ranges and 
cover almost all of the soft and firm tofu contents. FT-NIR spectra were collected over a spectral 
range from 4000 to 12000 cm" 1 (833 to 2500 nm) at a resolution of 8 cm" 1 with Spectrum One 
NTS. The beam size was set at 8.94 mm. The number of scans was 64 for each spectrum. 

3.2.2. Calibration results 

The TQ Analyst software was employed to process NIR spectra and develop calibration 
files. Totally 96 FT-NIR spectra (shown in Figure 2.5) were preprocessed by applying a suitable 
Multiplicative Scattering Correction (MSC). Partial Least Squares Type 1 (PLS-1) multivariate 
regression analyses were employed for high-quality calibration model developments. 

The optimized parameters for the calibration result are listed below: 



Figure 5. Overlay ofFT-NIR Reflectance spectra for soy tofu obtained with Spectrum One. 



Table 3. Optimized SECV, R 2 values and number of factors for the calibrations developed on 
Spectrum One NTS, wavelength range 4080 to 11000 cm 1 . 

Protein% Moisture% 

SECV 0.75 1.19 

R 2 0.999 0.998 

# Factors 10 8 

SEP 0.83 1.35 

The calibration for soy tofu is characterized by low standard errors (~1%) and high 
degrees of correlation between NIR calculated values and laboratory reference values (-99%), 
and can be used to measure protein content in tofu. 

3.3. NIR analysis of soy milk 

3.3.1. Sampling and experiments 

FT-NIRS measurements were carried out in quadruplicate for 27 soymilk samples with 
different protein and water contents. The liquid soymilk samples were made from a commercial 
soymilk powder product Mount Elephant Soybean Drink (Guangxi Cereal and Oil Product 
Company, Wuzhou City, Guangxi Province, China), with 10% protein and 69% carbohydrates. 
After mixing the soymilk powder with different portions of water, liquid soymilk samples were 
prepared for different concentrations. FT-NIR spectra were collected over a spectral range from 
4000 to 12000 cm" 1 (833 to 2500 nm) at a resolution of 8 cm" 1 with Spectrum One NTS. The 
beam size was set to be 8.94 mm. The number of scans was 64 for each spectrum. 

Owing to the fact that water is the dominant component in soymilk, protein bands on the 
soymilk spectra are overlapped by huge water bands. In order to get as much chemical 
information of the other components except water as possible, a specially designed metal 
reflector was used to obtain the transflectance spectra. Only 5 (ll of liquid sample was put onto 
the instrument each time, with the reflector covered on top of the liquid layer, in order not to lose 
diffuse reflectance signals. 

3.3.2. Calibration results 

The TQ Analyst software was employed to process NIR spectra and develop calibration 
files. Totally 108 FT-NIR spectra (see Figure 6.) were preprocessed by applying a suitable 



Multiplicative Scattering Correction (MSC). Partial Least Squares Type 1 (PLS-1) multivariate 
regression analyses were employed to develop high-quality calibration models. The composition 
ranges for calibration development were: protein 0.5% to 10%, water 1.7% to 100%, and 
carbohydrates 3.5% to 69.1%. These are quite wide concentration ranges and cover almost all of 
the soymilk and even tofu contents. The optimized parameters for the calibration result are listed 
below. 

(Figure 2.6.) 
Table 4. Optimized SECV, R 2 values and number of factors for the calibration of soymilk 
developed on Spectrum One NTS, wavelength range 4080 to 1 1500 cm" 1 . 





Protein% 


Fat% 


H 2 % 


Carbohydrates% 


SECV 


0.03 


0.02 


0.34 


0.23 


R 2 


0.999 


0.999 


0.999 


0.999 


# Factors 


9 


9 


9 


9 


SEP 


0.08 


0.04 


0.73 


0.53 



The calibration for soy milk is characterized by low standard errors, especially for protein 
and fat (<0.1%), and high degrees of correlation between NIR calculated values and laboratory 
reference values (-99%). It is suitable for measuring soymilk within regular concentration 
ranges. The results were reported as (see Appendix A) "Rapid Determinations of Soybean 
Isoflavones, Soy and Other Health Foods Composition by Fourier Transform Near Infrared 
Reflectance Spectroscopy", Jun Guo and Ion C. Baianu, Proceedings of the China and 
International Soy Conference and Exhibition 2002 (CISCE 2002), November 6-9, 2002, 391- 
392. 



Figure 6 (2.6). Overlay of 108 FT-NIR Trans flectance spectra for soymilk obtained with 
Spectrum One NTS 



3.4. NIR analysis of soybean isoflavones 

3.4.1. Sampling and experiments 

In order to develop NIR calibrations on such instruments for soybean composition 
analysis, soybean standard samples were selected from the USDA Soybean Germplasm 
Collection (Urbana, IL, USA). The selection of standard samples was based on their protein, oil, 
moisture, and isofiavone contents, to ensure that the ranges of standard sample constituent 
contents covered the full range of possible constituent variations of samples. To minimize 
screening effects of the soybean seed coat (especially black and brown coat) on isoflavones, 
soybean seeds were ground for preparation of standard samples. 

Twenty eight ground soybean samples from isofiavone standards plus one isofiavone 
tablet sample (NovaSoy tablet) were utilized in the calibration development, with isoflavones 
range from 0.04% to 0.9% (HPLC data), protein range from 34% to 47.1% (ZX-50 data), oil 
range from 12.8% to 23% (ZX-50 data), and moisture range from 5.6% to 11.0% (ZX-50 data). 
Laboratory reference values of isofiavone composition were obtained by HPLC analyses of 
soybeans, which were kindly provided by Dr. J. Widholm's laboratory at UIUC. The TQ Analyst 
software was employed to process NIR spectra and develop calibration files. Totally 116 FT-NIR 
spectra were preprocessed by applying a suitable Multiplicative Scattering Correction (MSC). 
Partial Least Squares Type 1 (PLS-1) multivariate regression analyses were employed for high- 
quality calibration model developments. 

The samples were ground with a Braun KSM2B Grinder. The average grinding time is 25 
seconds, producing a powder sample with a particle size ranging from 100 um to 200 um. 
Quadruplet FT-NIRS measurements were carried out for the 29 isofiavone samples with a weight 
of 300 mg each (two soybean seeds). FT-NIR spectra were collected over a spectral range from 
4000 to 12000 cm" 1 (833 to 2500 nm) at a resolution of 8 cm" 1 with Spectrum One NTS. The 
beam size was set at 8.94 mm. The number of scans was 32 for each spectrum. 

3.4.2. Calibration results 

Figure 7 (2.7) shows an overlay of FT-NIR spectra of ground soybeans for isoflavones 
standards. They are baseline corrected and normalized. A calibration was developed based on 
these spectra. Standard composition values were obtained with ZX-50 instrument for protein, oil, 
moisture and HPLC data for isoflavones. 



The optimized parameters for the calibration result are listed below in Table 2.5. Figure 
2.8 presents the calibration plot for calculated isoflavones% vs. actual (or reference) isofiavones 
%, with 9 factors. The correlation coefficient R and SECV (RMSEC) values are also listed in the 
figure. 

Table 5. Optimized SECV, R 2 values and number of factors for the calibration of soybean 
isofiavones developed on Spectrum One NTS, wavelength range 4100 to 10625 cm 1 . 





Protein % 


Oil% 


H 2 % 


Isofiavones % 


SECV 


0.67 


0.28 


0.12 


0.0146 


R 2 


0.989 


0.994 


0.995 


0.997 


# Factors 


6 


9 


9 


9 


SEP 


0.43 


0.32 


0.26 


0.0172 



This calibration for soybean isofiavones is characterized by low standard errors (<0.02%) 
and high degrees of correlation between NIR calculated values and laboratory reference values 
(-99%). For soybean samples containing a normal isoflavone content, i.e. 0.2% to 0.9%, the 
calibration is accurately applicable. For soybean samples containing a low isoflavone content, 
i.e. 0.04% to 0.2%, the calibration can roughly predict the isoflavone concentration. The 
accuracy of this calibration is comparable with that of a recently published calibration for 
soybean isofiavones developed with single half soybean seeds (You et al., 2002). The results 
were reported as (see Appendix) (1) "Rapid Determinations of Soybean Isofiavones, Soy and 
Other Health Foods Composition by Fourier Transform Near Infrared Reflectance 
Spectroscopy", Jun Guo and Ion C. Baianu, Proceedings of the China and International Soy 
Conference and Exhibition 2002 (CISCE 2002), November 6-9, 2002, 391-392. 



Figure 7 (2.7. ) Overlay of FT-NIR Reflectance spectra for soybean isoflavone standards 
obtained with Spectrum One NTS 



Figure 8 ( 2.8.) The calibration plot for calculated isoflavones% vs. actual (or reference) 
isoflavones%, with 9 factors. 



4. Conclusions 

PerkinElmer's SpectrumOne NTS FT-NIR instrument can be utilized for accurate 
measurements of food protein, oil (fat), carbohydrate and fiber contents for both solid and liquid 
samples, as well as isoflavones contents in soybean powder samples. It can be also employed to 
obtain detailed characterization of foods and to investigate the interactions between major food 
components such as: protein, oil, water and carbohydrates. Moreover, fast and economical 
measurements of food composition allow online quality control and chemical analysis in food 
production. 
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